This article gives a summary of how to structure a software project written in Python in a way that makes releasing it as a package easier. Project structure and configuration of a .toml file are described in detail. Finally, a quick introduction is given how to build a distribution, how to publish it and links are provided to documantation of automation of this process. Further links are provided for reference regarding documentation, CI/CD, unit testing and related issues.
Measures:
The reference implementation is written, documented and released. It is currently being maintained. Future work may include extending the application from purely simulation to including testrigs. Further RDM features are planned, including reproducibility from given results data (re-instantiation and re-simulation) and creating semantic graphs for describing the structure of resulting data written to HDF5 format.
SOFIRpy: Full implementation of all the things (and more) discussed in this article.
When it comes to sharing Python code with others, packaging is an essential step. Python packaging involves creating a distributable package that can be easily installed and used by others. A package in python is simply a directory that contains one or more Python files (modules), as well as an __init__.py file. Packages can be nested, meaning that a package can contain sub-packages, which can in turn contain more sub-packages or modules. This allows for complex projects to be organized in a hierarchical way, making it easier to manage and understand the code.
The packaging process in Python can be complex and challenging, with several different tools available, each with its own strengths and weaknesses. Some of the more popular ones are setuptools, Poetry, Hatch, Flit and PDM.
In order not to be overwhelmed by the multitude of options, this article will focus on the process of packaging a Python project from start to finish using setuptools.
The first steps in the packaging process are the creation of a virtual environment (technically this is optional but it makes life easier) and the setup of the project structure.
The virtual environment can be created using the following command.
Windows:
py -m pip venv venv
Mac:
python3 -m pip venv venv
After creating the virtual environment it needs to be activated:
Windows:
venv\Scripts\activate
Mac:
source venv/bin/activate
A python project is usually setup in the following of the two:
flat-layout: The package is located in a folder with the same name as the project name:
<project_name>
│
└───<project_name>
│ │ __init__.py
│ │ module1.py
│ │
│ └───subfolder1
│ │ __init__.py
│ │ module2.py
│ │ ...
│
└───tests
│
│ LICENSE
│ project.toml
│ README.md
src-layout: The folder with the package is inside another folder called src. The src-layout requires an installation of the package to be able to run its code. After configuring the pyproject.toml
file (see section Configuration of pyproject.toml), the package can be installed locally by running the following command:
pip install -e .
This performs and editable install of the package. Detailed information abut editable installs can be found here
<project_name>
│
└───src
│ └───<project_name>
│ │ __init__.py
│ │ module1.py
│ │
│ └───subfolder1
│ │ │ __init__.py
│ │ │ module2.py
│ │ │ ...
│
└───tests
│
│ LICENSE
│ project.toml
│ README.md
NOTE: The tests folder should not contain an __init__.py file since it is not considered a package. To be able to run the tests without including the __init__.py or manually adding the tests directory path to sys.path, the project needs to be installed first, even for the flat-layout.
The differences between these approaches are discussed here and here. The src-layout is considered best practice because it has several advantages over the flat layout. Therefore, the following example assumes a src layout.
The pyproject.toml
file is a configuration file used to define project metadata, dependencies, build configurations and tool configurations. It was first introduce in PEP 518 in order to allow more flexibility and control over the packaging process.
To define the build tool the pyproject.toml
file must contain a build-system
table. If setuptools should be used as the build system the pyproject.toml
file must be defied as follows.
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
The pyproject.toml
must also contain a project
table.This is where project metadata and dependencies are defined. Detailed information about this can be found in PEP 621 and here. setuptools
specific information can be found here.
[project]
name = "<package_name>" # required
version = "0.0.1" # required, can be substituted (dynamic = ["version"]) if version is defined dynamically
authors = [
{ name="Example Author", email="author@example.com" }, # optional
]
description = "package description" # optional
keywords = ["keyword1", "keyword2"] # optional
readme = "README.md" # optional
requires-python = ">=3.7" # optional
license = {file = "LICENSE"}
classifiers = [ # optional
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
dependencies = [ # optional
"pandas",
"numpy>=1.20.0",
]
[project.optional-dependencies] # optional
dev = [
"black>=22.12.0",
"isort>=5.12.0",
"pylint>=2.15.3"
]
test = ["pytest>=7.1.2"]
docs = ["sphinx>=5.1.1", "sphinx-rtd-theme>=1.0.0"]
[project.urls] # optional
homepage = "https://example.com"
documentation = "https://readthedocs.org"
repository = "https://github.com/me/example_project.git"
[project.scripts] # optional
example-cli = "<package_name>:main"
name
: Defines the name of the project. There are some specifications on what makes a valid name here. If the project should be uploaded to PyPI it must not already be taken on PyPI.
version
: Defines the package version. Version should comply to PEP 440. Some tools like setuptools allow to dynamically define this field. See here for more information. It is preferable to define the version number dynamically, as this prevents discrepancies between the version number defined in the package and the version number defined in the configuration file.
authors
: A list of authors.
description
: One-sentence summary of the package.
keyword
: Defines keywords describing the project. These keywords will appear on the project page on PyPI.
readme
: Defines the relative path to the README. Usually it is located in the same directory as the pyproject.toml file, allowing to only specify the file name. If the package is uploaded to PyPI, the content of the README is shown on the package page on PyPI.
requires-python
: Defines the python version supported by the package.
license
: Defines the relative path to the license file.
classifiers
: Defines additional meta data about the package. More information can be found here.
dependencies
: Defines core dependencies of the package. These will be automatically downloaded and installed when the package is installed. More information can be found here. Specifying a certain version number/interval is encouraged since it can help to ensure stability and reproducibility of the package.
optional-dependencies
: Allows to define dependencies that are not installed by default. This can have two use cases.
Since the optional dependencies are not installed by default, the following command needs to be run to install the project locally with optional dependencies.
pip install -e ".[test, docs, dev]"
urls
: Defines number of urls to show on PyPI.
scripts
: Allows to make scripts/functions within the package available as command-line tool. More information can be found here.
setuptools
allows to specify optional setuptools
-specific configurations. See here for detailed information. If the version is specified dynamically, setuptools needs to know where to get the version number from. The best practice is to define the version of the package within the top level __init__.py file:
__version__ = "0.0.1"
Inside the pyproject.toml
file the following needs to be defined:
[tool.setuptools.dynamic]
version = {attr = "<project_name>.__version__"}
Using the src-layout as the project structure, setuptools automatically detects the packages location. If the automatic discovery fails, the package location can be specified explicitly. Detailed information can be found here
[tool.setuptools.packages.find]
where = ["src"]
Information on how to include specific data files inside the package can be found here.
In Python packaging, there are two different types of distributions: source distributions (sdist) and built distributions.
Source distributions are packages that contain the source code for your project, as well as any supporting files like documentation, configuration files, and READMEs. These packages are intended to be built and installed on the target system by the end user. Source distributions can be built using tools like setuptools, and are typically distributed as tarballs (.tar.gz).
Built distributions also contain meta data but are pre-built packages that can be installed directly on the target system. Built distributions can be built using tools like setuptools and are typically distributed as wheels (.whl) files (introduced with PEP 427).
Newer versions of pip priorities the installation of built distributions, such as wheels, over source distributions. If a built distribution is available for the target system, pip will prefer to install it, as it is faster and more efficient than building from source. However, if a built distribution is not available for the target system, or if there is a compatibility problem, pip will revert to installing the source distribution. This behavior helps to ensure that packages are installed in the most efficient and compatible way.
First step in generating distribution is to make sure PyPA’s build is up to date:
pip install --upgrade build
Next the following command needs to be run in the same directory where the pyproject.toml
file is located:
python -m build
This command should generate two files in the dist directory.
Detailed information on how to upload the distribution to PyPI can be found here.
The build process and upload to PyPI can be automated by including these steps in the CI/CD pipeline. See GitLabs documentation for more information. A sample implementation can be found here
Packaging software modules written for specific research applications is a good idea. It allows referencing specific versions used for generating specific research results. Packages are inherently modular and can be combined with other packages for building bespoke environments for research projects. The software development process towards a distributed package encourages further best practices of RDM, such as documentation, version control, unit testing and CI/CD.
The Authors would like to thank the Federal Government and the Heads of Government of the Länder, as well as the Joint Science Conference (GWK), for their funding and support within the framework of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) - project number 442146713.